Phoneme - Viseme Mapping for German Video - Realistic Audio - Visual - Speech - Synthesis IKP - Working Paper NF 11

نویسندگان

  • Bianca Aschenberner
  • Christian Weiss
چکیده

In this working paper we introduce a German viseme set which we already use in our data-driven audio-visual synthesis system. The viseme set is essential for speech driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text. For text-to-speech synthesis, a transcription of the input text into the phonemic representation is used, in order to avoid ambiguous meanings and to acquire the correct pronunciation of the underlying input text. The transcription also serves as label in unitselection based synthesis systems. Likewise, the visual synthesis requires a transcription that represents analogue to the phonemes, the visual counterpart which is called visemes in related literature and serves also as unit label in our data-driven video-realistic synthesis system. A viseme could include several phonemes, since we can not for example visually differentiate between pronouncing a /b/ or a /p/. Thus, both sounds are bilabial plosives and consequently, have the same lip-based realisation. Therefore, we worked out an inventory of German viseme classes and developed a phoneme-viseme mapping for German that we currently use in our data-driven exemplary audio-visual synthesis system. The visemes are represented in a SAMPA-like labelling. For transcribing a large corpus, we trained a maximum entropy model for automatic visemic transcription and thus received feasible results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A German viseme-set for automatic transcription of input text used for audio-visual speech synthesis

In this paper, we introduce a German viseme inventory for visemically transcribing text according to phonetic transcribtion. A viseme set like the one presented in this work is essential for speech-driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text. For text-to-speech synthesis, a transcription of the...

متن کامل

Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis

The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to the visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme. In this research it was found that neither the use of standardized no...

متن کامل

Decoding visemes: improving machine lipreading (PhD thesis)

This thesis is about improving machine lip-reading, that is, the classification of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the vid...

متن کامل

Phoneme-to-viseme Mapping for Visual Speech Recognition

Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how many to model on the visual side in audio-visual speech recognition systems. This paper compares the use of 5 viseme maps in a continuous speech recognition task. The focus of the stud...

متن کامل

Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading

Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video can provide information that is complementary to the audio. Thus, the study of automatic lip-reading is important and is still an open problem. One of the ke...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005